Fallback to bootstrapping #3480

yacovm · 2024-10-17T22:04:01Z

Why this should be merged

How this works

How this was tested

The lifecycle of an avalanchego node can be separated to three main stages: Stage I - If supported by the VM, then a state sync occurs. Otherwise, this stage is skipped. In the X-chain, this stage is entirely substituted by a custom bootstrapping phase that replicates and executes the X-chain DAG. Stage II - A stage called bootstrapping, which entails replicating the blocks from other avalanchego nodes, and executing them. Stage III - The final form of an avalanchego node, in which it runs the snowman consensus protocol. The phases are implemented by components called "engines": - avalanche bootstrapping - snowman - state syncer - snowman bootstrapper And the handler in snow/networking/handler/ is responsible to route messages from the network into the correct engine. Engines all implement the same common.Engine interface, but the interface consists of the union of all operations across all engines. Indeed, it is often the case that a message of type `m` dispatched by engine `e`, cannot be processed by a different engine `e'. For instance, a Chits message cannot be processed by any engine other than consensus, and a message about a query of state summary is only relevant to the state sync engine. To that end, each engine simply implements a no-op dispatcher for messages it does not care about. The biggest problem with the existing aforementioned structure, is that the lifecycle of the engines imposes a strict one way movement across the various stages, and there is no single component which consolidates the transition among the stages. The movement between the various stages takes place by a callback given to each engine at every stage but the final one (Stage I and Stage II). The structure therefore makes it difficult to introduce movement from snowman consensus back to bootstrapping / state sync, or to have better control over the message dispatch. This commit unifies all engines into a single one under snow/engine/unified As a result, the implementation of the handler in snow/networking/handler/handler.go is now simpler, as it only interacts with the unified engine, and it never needs to query the snow.EngineState. The state transition between the various stages is now taken care of by the unified engine, and since the code to dispatch messages to the right engine is now all in the unified engine, it's not only more testable but it lets us move among the stages in the same place where we consider the stage we're in to dispatch the message to the correct engine. Signed-off-by: Yacov Manevich <[email protected]>

Signed-off-by: Joshua Kim <[email protected]>

…ava-labs#3461)

Co-authored-by: Darioush Jalali <[email protected]>

This commit adds a mechanism that detects that our node is behind the majority of the stake. The intent is to later have this mechanism be the trigger for the bootstrapping mechanism. Currently, the bootstrapping mechanism is only active upon node boot, but not at a later point. The mechanism works in the following manner: - It intercepts the snowman engine's Chits message handling, and upon every reception of the Chits message, the mechanism that detects if the node is a straggler (a node with a ledger height behind the rest) may be invoked, if it wasn't invoked too recently. - The mechanism draws statistics from the validators known to it, and computes the latest accepted block for each validator. - The mechanism then proceeds to determine which blocks are pending to be processed (a block pending to be processed was not accepted). - The mechanism then collects a snapshot of all blocks it hasn't accepted yet, and the amount of stake that has accepted this block. - The mechanism then waits for its next invocation, in order to see if it has accepted blocks correlated with enough stake. - If there is too much stake that has accepted blocks by other nodes correlated to it that the node hasn't accepted, then the mechanism announces the node is behind, and returns the time period between the two invocations. - The mechanism sums the total time it has detected the node is behind, until a sampling concludes it is not behind, and then the total time is nullified. Signed-off-by: Yacov Manevich <[email protected]>

Signed-off-by: Yacov Manevich <[email protected]>

marun · 2024-10-18T01:24:49Z

If this PR is based on Unify engines #3405, please change the base from master to show only the new commits.
Is it really necessary for these PRs to be so big? The scale here is large enough to make diligent review very difficult and the effort required of your reviewers will be considerable.

yacovm · 2024-10-18T12:35:51Z

If this PR is based on Unify engines #3405, please change the base from master to show only the new commits.

Yeah I need to tidy up the commit chain. I didn't mean to post this for review, I forgot to put it in draft mode.

Is it really necessary for these PRs to be so big? The scale here is large enough to make diligent review very difficult and the effort required of your reviewers will be considerable.

I can split the PRs to pieces, that's not an issue. I just got the PR you mention working two days ago, so I didn't think it would be reviewed by anyone anytime soon :-)

Signed-off-by: Yacov Manevich <[email protected]>

github-actions · 2024-11-24T00:00:29Z

This PR has become stale because it has been open for 30 days with no activity. Adding the lifecycle/frozen label will cause this PR to ignore lifecycle events.

yacovm and others added 12 commits October 16, 2024 18:34

Remove unused meteredHandler struct type (ava-labs#3450)

a3a4aaa

Signed-off-by: Joshua Kim <[email protected]>

Add ACP-118 caching support (ava-labs#3451)

d891af5

[testing] Enable verbose image build output for bootstrap-monitor e2e (…

d02a375

…ava-labs#3461)

Remove unexpected block unwrapping (ava-labs#3455)

c332ec7

[ci] Update MacOS jobs to macos-14 (ava-labs#3462)

18a3c62

ACP-77: Include validator capacity into the fee config (ava-labs#3464)

7817c28

Update versions for v1.11.12 (ava-labs#3470)

ee6f5bb

Co-authored-by: Darioush Jalali <[email protected]>

Rebase on top of unified engine

8845cf1

Signed-off-by: Yacov Manevich <[email protected]>

Fallback to bootstrapping when detected we're behind

31d0184

Signed-off-by: Yacov Manevich <[email protected]>

Only set VM's state once

4758bc0

Signed-off-by: Yacov Manevich <[email protected]>

yacovm requested review from marun, joshua-kim and StephenButtolph as code owners October 17, 2024 22:04

yacovm marked this pull request as draft October 18, 2024 12:35

yacovm force-pushed the fallbackToBootstrapping branch 5 times, most recently from 3eb6fb1 to c906bc3 Compare October 18, 2024 20:16

shutdown VMs

9d8a7b1

Signed-off-by: Yacov Manevich <[email protected]>

yacovm force-pushed the fallbackToBootstrapping branch from c906bc3 to 9d8a7b1 Compare October 18, 2024 20:33

github-actions bot added the lifecycle/stale label Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fallback to bootstrapping #3480

Fallback to bootstrapping #3480

yacovm commented Oct 17, 2024

marun commented Oct 18, 2024

yacovm commented Oct 18, 2024

github-actions bot commented Nov 24, 2024

Fallback to bootstrapping #3480

Are you sure you want to change the base?

Fallback to bootstrapping #3480

Conversation

yacovm commented Oct 17, 2024

Why this should be merged

How this works

How this was tested

marun commented Oct 18, 2024

yacovm commented Oct 18, 2024

github-actions bot commented Nov 24, 2024